Clean up data
In this homework assignment, we will explore, analyze and model a
data set containing approximately 8000 records representing a customer
at an auto insurance company. We will build multiple linear regression
on the continuous variable TARGET_AMT and binary logistic
regression on the TARGET_FLAG using the training data to
predict the probability that a person will crash their car and also the
amount of money it will cost if the person does crash their car.
We are going to build several models using different techniques and variable selection. In order to best assess our predictive model, we created a test set within our training data, and split it along an 80/20 training/testing proportion, before applying the finalized models to a separate evaluation dataset that did not contain the target.
The insurance training dataset contains 8161 observations of 26 variables, each record represents a customer at an auto insurance company. The evaluation dataset contains 2141 observations of 26 variables. The descriptions of each column are below.
Each record has two response variables. The first response variable,
TARGET_FLAG, is a 1 or a 0. A “1” means that the person was
in a car crash. A zero means that the person was not in a car crash. The
second response variable is TARGET_AMT. This value is zero
if the person did not crash their car. But if they did crash their car,
this number will be a value greater than zero.
The training data can be previewed below. The
TARGET_FLAG column is the binary dependent variable
denoting if a car was in a crash (target = 1) or not (target = 0).
TARGET_AMT is a numeric dependent variable and represents
the amount of time the car spent on repairs in case of crash. The
minimum is 0 (car wasn’t in crash, no time spent on repairs), the
maximum is 107586.1.
The table below provides valuable descriptive statistics about the
training data. 14 variables are categorical, 12 variables are numeric.
There is no missing data for categorical variables while numeric
variables YOJ (years on job) has 5% of missing data,
CAR_AGE (vehicle age) has 6%, and AGE (age of
driver) has less than 1%. Most of the numeric variables have a minimum
of zero. Some numbers seem strange, we should deal with it later. For
example, CAR_AGE has the minimum value of -3. Some of the
variables are character though they should be numeric and vice versa.
Variable OLDCLAIM, BLUEBOOK,
HOME_VAL, INCOME have $ sign in front a
number, we should remove the sign and transform the variable to numeric.
Variables MSTATUS (Marital Status), EDUCATION,
JOB, CAR_TYPE, SEX AND
URBANICITY contain prefix z_ that should be
removed as well. ADD SOMETHING ELSE FOR SUMMARY
| Name | train_df |
| Number of rows | 8161 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| PARENT1 | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| MSTATUS | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| SEX | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| EDUCATION | 0 | 1 | 3 | 12 | 0 | 5 | 0 |
| JOB | 0 | 1 | 6 | 12 | 0 | 9 | 0 |
| CAR_USE | 0 | 1 | 7 | 10 | 0 | 2 | 0 |
| CAR_TYPE | 0 | 1 | 3 | 11 | 0 | 6 | 0 |
| RED_CAR | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| REVOKED | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| URBANICITY | 0 | 1 | 19 | 19 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| TARGET_FLAG | 0 | 1.00 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 |
| TARGET_AMT | 0 | 1.00 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 |
| KIDSDRIV | 0 | 1.00 | 0.17 | 0.51 | 0 | 0 | 0 | 0 | 4.0 |
| AGE | 6 | 1.00 | 44.79 | 8.63 | 16 | 39 | 45 | 51 | 81.0 |
| HOMEKIDS | 0 | 1.00 | 0.72 | 1.12 | 0 | 0 | 0 | 1 | 5.0 |
| YOJ | 454 | 0.94 | 10.50 | 4.09 | 0 | 9 | 11 | 13 | 23.0 |
| INCOME | 445 | 0.95 | 61898.09 | 47572.68 | 0 | 28097 | 54028 | 85986 | 367030.0 |
| HOME_VAL | 464 | 0.94 | 154867.29 | 129123.77 | 0 | 0 | 161160 | 238724 | 885282.0 |
| TRAVTIME | 0 | 1.00 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 |
| BLUEBOOK | 0 | 1.00 | 15709.90 | 8419.73 | 1500 | 9280 | 14440 | 20850 | 69740.0 |
| TIF | 0 | 1.00 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 |
| OLDCLAIM | 0 | 1.00 | 4037.08 | 8777.14 | 0 | 0 | 0 | 4636 | 57037.0 |
| CLM_FREQ | 0 | 1.00 | 0.80 | 1.16 | 0 | 0 | 0 | 2 | 5.0 |
| MVR_PTS | 0 | 1.00 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 |
| CAR_AGE | 510 | 0.94 | 8.33 | 5.70 | -3 | 1 | 8 | 12 | 28.0 |
TARGET_FLAG variable. Class
1 takes 27% and class 0 takes 63% of the
target variable. As a result, we have unbalanced class distribution for
our target variable that we have to deal with, we have to take some
additional steps (bootstrapping, etc) before using logistic regression.
| Value | % |
|---|---|
| 0 | 0.74 |
| 1 | 0.26 |
The distribution of the second target variable
TARGET_AMT is right skewed, we will also transform the
variable to make it follow the normal distribution (log/BoxCox).
## `geom_smooth()` using formula = 'y ~ x'
First, we will remove prefixes z_ and $
together with the INDEX variable (identification
Variable).
Transform to factor variables TARGET_FLAG,
CAR_TYPE, CAR_USE, EDUCATION,
JOB, MSTATUS, PARENT1,
RED_CAR, REVOKED, SEX and
URBANICITY.
Transform to numeric variables INCOME,
HOME_VAL, BLUEBOOK, OLDCLAIM.
As we see below, there are no unwanted characters in the factor
variables. The JOB variable contains empty level
"", it was substituted with "Unknown".
RED_CAR levels will be “Yes/No” instead of “yes/no”.
## $TARGET_FLAG
## [1] "0" "1"
##
## $PARENT1
## [1] "No" "Yes"
##
## $CAR_TYPE
## [1] "Minivan" "Panel Truck" "Pickup" "Sports Car" "SUV"
## [6] "Van"
##
## $JOB
## [1] "Blue Collar" "Clerical" "Doctor" "Home Maker" "Lawyer"
## [6] "Manager" "Professional" "Student" "Unknown"
##
## $CAR_USE
## [1] "Commercial" "Private"
##
## $URBANICITY
## [1] "Highly Rural/ Rural" "Highly Urban/ Urban"
##
## $RED_CAR
## [1] "No" "Yes"
##
## $REVOKED
## [1] "No" "Yes"
##
## $MSTATUS
## [1] "No" "Yes"
##
## $EDUCATION
## [1] "<High School" "Bachelors" "High School" "Masters" "PhD"
##
## $SEX
## [1] "F" "M"
First, we’ll manually adjust two special cases of missing or outlier values.
In cases where YOJ is zero and INCOME is
NA, we’ll set INCOME to zero to avoid imputing new values
over legitimate instances of non-employment.
## JOB
## 1 Home Maker
## 2 Student
## 36 Unknown
There is also at least one value of CAR_AGE that is less
than zero - we’ll assume this is a data collection error and set it to
zero (representing a brand-new car.)
We’ll use MICE to impute values for our remaining variables with
missing values - AGE, YOJ,
CAR_AGE, INCOME and HOME_VALUE.
We might reasonably assume there are relationships between them (older,
more years on the job may correlate with higher income and home value).
Taking simple means or medians might suppress those features, but MICE
should provide a better imputation.
Log transformation will be applied to variables INCOME,
TARGET_AMT, OLDCLAIM to transform their
distributions from right-skewed to the normal.
BoxCox transformation will be applied to variables
BLUEBOOK, TRAVTIME, TIF, so they
follow the normal distribution.
Binning values for CAR_AGE, HOME_VAL and
TIF:
Creating dummy variables for factors with two levels:
## [1] "TARGET_FLAG" "TARGET_AMT" "KIDSDRIV" "AGE"
## [5] "HOMEKIDS" "YOJ" "INCOME" "HOME_VAL"
## [9] "EDUCATION" "JOB" "TRAVTIME" "BLUEBOOK"
## [13] "TIF" "CAR_TYPE" "OLDCLAIM" "CLM_FREQ"
## [17] "MVR_PTS" "CAR_AGE" "CAR_AGE_BIN" "HOME_VAL_BIN"
## [21] "TIF_BIN" "MALE" "MARRIED" "LIC_REVOKED"
## [25] "CAR_RED" "PRIVATE_USE" "SINGLE_PARENT" "URBAN"
## [29] "CAR_CRASH"
## Warning in data.table::melt(.): The melt generic in data.table has been passed
## a data.frame and will attempt to redirect to the relevant reshape2 method;
## please note that reshape2 is deprecated, and this redirection is now deprecated
## as well. To continue using melt methods from reshape2 while both libraries are
## attached, e.g. melt.list, you can prepend the namespace like reshape2::melt(.).
## In the next version, this warning will become an error.
## No id variables; using all as measure variables
## Warning: attributes are not identical across measure variables; they will be
## dropped
(Move this split to after all transforms)
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.450670e+00
## KIDSDRIV 8.242256e-01
## AGE -1.590170e-03
## HOMEKIDS 2.966269e-02
## YOJ -1.843188e-02
## INCOME -6.032096e-06
## HOME_VAL .
## EDUCATIONBachelors -6.083644e-01
## EDUCATIONHigh School 6.004131e-02
## EDUCATIONMasters -4.377078e-01
## EDUCATIONPhD -1.234115e-01
## JOBClerical 1.422133e-01
## JOBDoctor -9.728426e-01
## JOBHome Maker .
## JOBLawyer -1.137389e-01
## JOBManager -1.529192e+00
## JOBProfessional -1.013136e-01
## JOBStudent -1.994015e-01
## JOBUnknown -6.904978e-01
## TRAVTIME 6.940606e-02
## BLUEBOOK -7.084013e-03
## TIF -1.700780e-01
## CAR_TYPEPanel Truck 6.532344e-01
## CAR_TYPEPickup 7.223041e-01
## CAR_TYPESports Car 1.591797e+00
## CAR_TYPESUV 1.135741e+00
## CAR_TYPEVan 7.850491e-01
## OLDCLAIM 4.862112e-02
## CLM_FREQ 7.523577e-02
## MVR_PTS 2.709291e-01
## CAR_AGE -7.336134e-03
## CAR_AGE_BIN .
## HOME_VAL_BIN -3.339554e-01
## TIF_BIN -1.534265e-01
## MALE1 4.100768e-02
## MARRIED1 -8.001639e-01
## LIC_REVOKED1 1.591403e+00
## CAR_RED1 3.091972e-02
## PRIVATE_USE1 -1.677192e+00
## SINGLE_PARENT1 1.042326e+00
## URBAN1 3.650730e+00
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.123592e+00
## KIDSDRIV 5.781403e-01
## AGE .
## HOMEKIDS .
## YOJ .
## INCOME -6.145362e-06
## HOME_VAL .
## EDUCATIONBachelors .
## EDUCATIONHigh School 2.565739e-01
## EDUCATIONMasters .
## EDUCATIONPhD .
## JOBClerical .
## JOBDoctor .
## JOBHome Maker .
## JOBLawyer .
## JOBManager -1.060109e+00
## JOBProfessional .
## JOBStudent .
## JOBUnknown .
## TRAVTIME 4.069176e-02
## BLUEBOOK -5.715392e-03
## TIF -1.354175e-01
## CAR_TYPEPanel Truck .
## CAR_TYPEPickup .
## CAR_TYPESports Car 5.614647e-01
## CAR_TYPESUV 2.868858e-01
## CAR_TYPEVan .
## OLDCLAIM 6.328528e-02
## CLM_FREQ 1.384936e-03
## MVR_PTS 2.502005e-01
## CAR_AGE -1.463889e-02
## CAR_AGE_BIN .
## HOME_VAL_BIN -3.804650e-01
## TIF_BIN -6.383520e-02
## MALE1 .
## MARRIED1 -4.457439e-01
## LIC_REVOKED1 1.321738e+00
## CAR_RED1 .
## PRIVATE_USE1 -1.384149e+00
## SINGLE_PARENT1 1.010721e+00
## URBAN1 2.952228e+00
##
## Call: cv.glmnet(x = X, y = Y, nfolds = 5, family = "gaussian", standardize = TRUE, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.01331 51 25.33 0.4037 37
## 1se 0.16407 24 25.73 0.4060 20
## $mse
## lambda.1se
## 25.50668
## attr(,"measure")
## [1] "Mean-Squared Error"
##
## $mae
## lambda.1se
## 4.166593
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -48241.06
##
## $BIC
## [1] -47990.49
## $AICc
## [1] -44758.86
##
## $BIC
## [1] -44623.31
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.450670e+00
## KIDSDRIV 8.242256e-01
## AGE -1.590170e-03
## HOMEKIDS 2.966269e-02
## YOJ -1.843188e-02
## INCOME -6.032096e-06
## HOME_VAL .
## EDUCATIONBachelors -6.083644e-01
## EDUCATIONHigh School 6.004131e-02
## EDUCATIONMasters -4.377078e-01
## EDUCATIONPhD -1.234115e-01
## JOBClerical 1.422133e-01
## JOBDoctor -9.728426e-01
## JOBHome Maker .
## JOBLawyer -1.137389e-01
## JOBManager -1.529192e+00
## JOBProfessional -1.013136e-01
## JOBStudent -1.994015e-01
## JOBUnknown -6.904978e-01
## TRAVTIME 6.940606e-02
## BLUEBOOK -7.084013e-03
## TIF -1.700780e-01
## CAR_TYPEPanel Truck 6.532344e-01
## CAR_TYPEPickup 7.223041e-01
## CAR_TYPESports Car 1.591797e+00
## CAR_TYPESUV 1.135741e+00
## CAR_TYPEVan 7.850491e-01
## OLDCLAIM 4.862112e-02
## CLM_FREQ 7.523577e-02
## MVR_PTS 2.709291e-01
## CAR_AGE -7.336134e-03
## CAR_AGE_BIN .
## HOME_VAL_BIN -3.339554e-01
## TIF_BIN -1.534265e-01
## MALE1 4.100768e-02
## MARRIED1 -8.001639e-01
## LIC_REVOKED1 1.591403e+00
## CAR_RED1 3.091972e-02
## PRIVATE_USE1 -1.677192e+00
## SINGLE_PARENT1 1.042326e+00
## URBAN1 3.650730e+00
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.716098e+00
## KIDSDRIV 4.114268e-01
## AGE -1.274043e-03
## HOMEKIDS 2.779506e-02
## YOJ -5.912864e-03
## INCOME -3.964779e-06
## HOME_VAL .
## EDUCATIONBachelors -3.129844e-01
## EDUCATIONHigh School 1.563180e-02
## EDUCATIONMasters -2.467550e-01
## EDUCATIONPhD -4.260311e-02
## JOBClerical 6.781448e-02
## JOBDoctor -4.680465e-01
## JOBHome Maker -2.933138e-02
## JOBLawyer .
## JOBManager -7.506532e-01
## JOBProfessional -3.377667e-02
## JOBStudent -1.395579e-01
## JOBUnknown -2.049771e-01
## TRAVTIME 3.869096e-02
## BLUEBOOK -4.230398e-03
## TIF -5.004348e-02
## CAR_TYPEPanel Truck 4.776375e-01
## CAR_TYPEPickup 4.281607e-01
## CAR_TYPESports Car 8.755467e-01
## CAR_TYPESUV 6.355374e-01
## CAR_TYPEVan 5.122177e-01
## OLDCLAIM 2.504929e-02
## CLM_FREQ 3.513050e-02
## MVR_PTS 1.068372e-01
## CAR_AGE -2.727552e-03
## CAR_AGE_BIN .
## HOME_VAL_BIN -1.660078e-01
## TIF_BIN -1.261495e-01
## MALE1 5.464939e-03
## MARRIED1 -4.532368e-01
## LIC_REVOKED1 6.960323e-01
## CAR_RED1 3.006562e-02
## PRIVATE_USE1 -8.083996e-01
## SINGLE_PARENT1 3.882197e-01
## URBAN1 2.285270e+00
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.464448e+00
## KIDSDRIV 3.295352e-01
## AGE -1.769623e-04
## HOMEKIDS 1.048500e-02
## YOJ .
## INCOME -3.680548e-06
## HOME_VAL .
## EDUCATIONBachelors -1.070104e-01
## EDUCATIONHigh School 7.383992e-02
## EDUCATIONMasters -3.456691e-02
## EDUCATIONPhD .
## JOBClerical 3.677502e-02
## JOBDoctor -7.665609e-02
## JOBHome Maker .
## JOBLawyer .
## JOBManager -5.667630e-01
## JOBProfessional .
## JOBStudent .
## JOBUnknown .
## TRAVTIME 2.797189e-02
## BLUEBOOK -3.228510e-03
## TIF -5.098952e-02
## CAR_TYPEPanel Truck .
## CAR_TYPEPickup 6.408488e-02
## CAR_TYPESports Car 4.575649e-01
## CAR_TYPESUV 2.774664e-01
## CAR_TYPEVan 3.657761e-02
## OLDCLAIM 2.599506e-02
## CLM_FREQ 1.974628e-02
## MVR_PTS 9.972623e-02
## CAR_AGE -7.803403e-03
## CAR_AGE_BIN .
## HOME_VAL_BIN -1.642793e-01
## TIF_BIN -8.078956e-02
## MALE1 .
## MARRIED1 -3.288395e-01
## LIC_REVOKED1 6.046503e-01
## CAR_RED1 .
## PRIVATE_USE1 -7.491766e-01
## SINGLE_PARENT1 3.783721e-01
## URBAN1 1.913118e+00
##
## Call: cv.glmnet(x = X, y = Y, nfolds = 5, family = "binomial", link = "logit", standardize = TRUE, alpha = 1)
##
## Measure: Binomial Deviance
##
## Lambda Index Measure SE Nonzero
## min 0.001033 51 0.9039 0.010248 37
## 1se 0.007285 30 0.9132 0.009009 28
## $deviance
## lambda.1se
## 0.9055072
## attr(,"measure")
## [1] "Binomial Deviance"
##
## $class
## lambda.1se
## 0.2133885
## attr(,"measure")
## [1] "Misclassification Error"
##
## $auc
## [1] 0.810068
## attr(,"measure")
## [1] "AUC"
##
## $mse
## lambda.1se
## 0.2944212
## attr(,"measure")
## [1] "Mean-Squared Error"
##
## $mae
## lambda.1se
## 0.6091211
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1628.926
##
## $BIC
## [1] -1378.357
## $AICc
## [1] -1565.658
##
## $BIC
## [1] -1375.96
## 41 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.716098e+00
## KIDSDRIV 4.114268e-01
## AGE -1.274043e-03
## HOMEKIDS 2.779506e-02
## YOJ -5.912864e-03
## INCOME -3.964779e-06
## HOME_VAL .
## EDUCATIONBachelors -3.129844e-01
## EDUCATIONHigh School 1.563180e-02
## EDUCATIONMasters -2.467550e-01
## EDUCATIONPhD -4.260311e-02
## JOBClerical 6.781448e-02
## JOBDoctor -4.680465e-01
## JOBHome Maker -2.933138e-02
## JOBLawyer .
## JOBManager -7.506532e-01
## JOBProfessional -3.377667e-02
## JOBStudent -1.395579e-01
## JOBUnknown -2.049771e-01
## TRAVTIME 3.869096e-02
## BLUEBOOK -4.230398e-03
## TIF -5.004348e-02
## CAR_TYPEPanel Truck 4.776375e-01
## CAR_TYPEPickup 4.281607e-01
## CAR_TYPESports Car 8.755467e-01
## CAR_TYPESUV 6.355374e-01
## CAR_TYPEVan 5.122177e-01
## OLDCLAIM 2.504929e-02
## CLM_FREQ 3.513050e-02
## MVR_PTS 1.068372e-01
## CAR_AGE -2.727552e-03
## CAR_AGE_BIN .
## HOME_VAL_BIN -1.660078e-01
## TIF_BIN -1.261495e-01
## MALE1 5.464939e-03
## MARRIED1 -4.532368e-01
## LIC_REVOKED1 6.960323e-01
## CAR_RED1 3.006562e-02
## PRIVATE_USE1 -8.083996e-01
## SINGLE_PARENT1 3.882197e-01
## URBAN1 2.285270e+00
## True
## Predicted 0 1 Total
## 0 1102 244 1346
## 1 100 187 287
## Total 1202 431 1633
##
## Percent Correct: 0.7893